Economics

1 Dimension Overview

Shown in the diagram below are a total of 45 indicators within the economics dimension. Indices are labeled within the diagram. 17 indicators are both included in the Wiltshire et al. framework as well as being studied by one or more teams (red), 9 are included in the Wiltshire et al. but not currently belong studied (green), while 19 were not in the original framework, but have been added by one or more teams (blue).

The points beside each indicator name represent the number of secondary data metrics that have been aggregated for each indicator. Sources include USDA NASS, BLS, ERS, Census Bureau, and others. The quality and appropiateness of these metrics vary widely - I do not mean to suggest that having more of them means an indicator is more accurately better represented. For more information on the data sources, head to Section 2 below.

One other point to note here is that I removed several dozen metrics from BLS wage labor data broken down by NAICS industry code so as not to inflate that indicator relative to the others.

Code
## Load packages
pacman::p_load(
  ggraph,
  igraph,
  dplyr,
  RColorBrewer,
  viridisLite,
  ggrepel,
  stringr
)

conflicted::conflicts_prefer(
  dplyr::as_data_frame(),
  .quiet = TRUE
)

## Load data for tree and metrics
dat <- readRDS('data/trees/econ_tree.rds') %>% 
  select(Dimension:Source)
metadata_all <- readRDS('data/sm_data.rds')[['metadata']]
meta <- metadata_all %>% 
  filter(
    dimension == 'economics'
  )

# Rename metadata so it fits into formatting of tree data
# This is quite not ideal - Note to harmonize this properly later
meta <- meta %>% 
  mutate(
    indicator = str_to_sentence(indicator),
    indicator = case_when(
      str_detect(indicator, '^Assets') ~ 'Balance sheet (assets and liabilities)',
      str_detect(indicator, '^Business failure') ~ 'Business failure rate of food business',
      str_detect(indicator, '^Direct') ~ '% direct-to-consumer sales',
      str_detect(indicator, '^Job avail') ~ 'Availability of good-paying jobs in food systems',
      str_detect(indicator, '^Local sales') ~ '% local sales',
      str_detect(indicator, '^Operator salary') ~ 'Operator salary / wage',
      str_detect(indicator, '^Total sales') ~ 'Total sales / revenue',
      str_detect(indicator, '^Wealth/income') ~ 'Wealth / income distribution',
      TRUE ~ indicator
    )
  ) 

# Join counts of secondary data metrics to original dataset
# Remove the NAICS variables - there are so many of them, don't add much
counts <- meta %>% 
  filter(str_detect(variable_name, '^lq|lvl|Lvl|Naics', negate = TRUE)) %>% 
  group_by(indicator) %>% 
  dplyr::summarize(count = n())


## Make edges
# include groupings by dimension, then combine them
edges <- list()
edges$dim_ind <- dat %>% 
  select(Dimension, Index) %>% 
  unique() %>% 
  dplyr::rename(from = Dimension, to = Index) %>% 
  mutate(group = to)
edges$ind_ind <- dat %>% 
  select(Index, Indicator) %>% 
  unique() %>% 
  dplyr::rename(from = Index, to = Indicator) %>% 
  mutate(group = from)
edges <- bind_rows(edges)

# Add column for use (will use in colors of text?)
edges$group <- c(rep(NA, 10), dat$Source)


## Make vertices
# Each line is a single vertex (dimension, index, or indicator)
# We are just giving them random values to control point size for now
vertices = data.frame(
  name = unique(c(as.character(edges$from), as.character(edges$to)))
) %>% 
  left_join(counts, by = join_by(name == indicator)) %>% 
  dplyr::rename('value' = count)

# Add the dimension groupings to the vertices as well
vertices$group = edges$group[match(vertices$name, edges$to)]

# Calculate the angles to arrange indicator labels
vertices$id = NA
myleaves = which(is.na(match(vertices$name, edges$from)))
nleaves = length(myleaves)
vertices$id[myleaves] = seq(1:nleaves)
vertices$angle = 90 - 360 * vertices$id / nleaves

# Calculate alignment of indicator labels
vertices$hjust <- ifelse(vertices$angle < -90, 1, 0)

# Flip label angles around 180 degrees if they are facing the wrong way
vertices$angle <- ifelse(vertices$angle < -90, vertices$angle + 180, vertices$angle)


## Create graph
# Make ggraph object from edges and vertices
graph <- graph_from_data_frame(edges, vertices = vertices)

# Plot the graph
ggraph(graph, layout = 'dendrogram', circular = TRUE) +
  
  # Color edges by dimension
  geom_edge_diagonal(color = 'black', width = 0.5) +
  
  # Create text for indicators using angles, hjust, and dimension groupings
  geom_node_text(
    aes(
      x = x * 1.15,
      y = y * 1.15,
      filter = leaf,
      label = name,
      angle = angle,
      hjust = hjust,
      colour = group
    ),
    size = 3,
    alpha = 1
  ) +
  
  # Label indices within graph
  geom_label_repel(
    aes(
      x = x,
      y = y,
      label = ifelse(name %in% unique(dat$Index), name, NA)
    ),
    label.padding = unit(0.15, "lines"),
    label.r = unit(0.3, "lines"),
    label.size = 0.05,
    size = 2.25,
    force = 0.1,    
    force_pull = 1, 
    max.overlaps = 10 
  ) +
  
  # Make the points for indicators based on secondary metric count
  geom_node_point(
    aes(
      filter = leaf,
      x = x * 1.07,
      y = y * 1.07,
      colour = group,
      size = value
    ),
    alpha = 0.4
  ) +
  
  # Various formatting options
  scale_colour_manual(values = brewer.pal(3, 'Set1')) +
  # scale_size_continuous(range = c(0.1, 7)) +
  theme_void() +
  theme(
    plot.margin = unit(c(0, 0, 0, 0), "cm")
  ) +
  scale_colour_manual(
    name = "Indicator Use",
    values = brewer.pal(3, 'Set1'),
    labels = c("Both", "Current Only", "Wiltshire Only")
  ) +
  expand_limits(x = c(-2.5, 2.5), y = c(-2.5, 2.5))

Radial dendrogram of Sustainability Metrics framework

2 Metadata Table

This is table to explore metadata for secondary metrics data. It does not include the values or geographic areas themselves, but does include definitions, sources, and links to the data used.

Using the table:

  • Click column headers to sort
  • Global search in the top right, or column search in each header
  • Change page length and page through results at the bottom
  • Use the download button to download a .csv file of the filtered table
  • Click the arrow on the left of each row for details, including a URL to the data source.
Code
pacman::p_load(
  dplyr,
  reactable,
  stringr,
  htmltools
)

# Load full metadata table
metadata_all <- readRDS('data/sm_data.rds')[['metadata']]

# Pick out variables to display
metadata <- metadata_all %>% 
  select(
    metric,
    'Variable Name' = variable_name,
    definition,
    dimension,
    index,
    indicator,
    units,
    'Year' = latest_year, # Renaming latest year as year, not including og year
    source,
    scope,
    resolution,
    url
) %>% 
  setNames(c(str_to_title(names(.))))


###
htmltools::browsable(
  tagList(
    
    tags$div(
      style = "display: flex; gap: 16px; margin-bottom: 20px; justify-content: center;",
      
      tags$button(
        class = "btn btn-primary",
        style = "display: flex; align-items: center; gap: 8px; padding: 8px 12px;",
        tagList(fontawesome::fa("download"), "Show/hide more columns"),
        onclick = "Reactable.setHiddenColumns('metadata_table', prevColumns => {
          return prevColumns.length === 0 ? ['Definition', 'Scope', 'Resolution', 'Url'] : []
        })"
      ),
      
      tags$button(
        class = "btn btn-primary",
        style = "display: flex; align-items: center; gap: 8px; padding: 8px 12px;",
        tagList(fontawesome::fa("download"), "Download as CSV"),
        onclick = "Reactable.downloadDataCSV('metadata_table', 'sustainability_metadata.csv')"
      )
    ),
    
    reactable(
      metadata,
      sortable = TRUE,
      resizable = TRUE,
      filterable = TRUE,
      searchable = TRUE,
      pagination = TRUE,
      bordered = TRUE,
      wrap = TRUE,
      rownames = FALSE,
      onClick = 'select',
      striped = TRUE,
      pageSizeOptions = c(5, 10, 25, 50, 100),
      defaultPageSize = 5,
      showPageSizeOptions = TRUE,
      highlight = TRUE,
      style = list(fontSize = "14px"),
      compact = TRUE,
      columns = list(
        Metric = colDef(
          minWidth = 200,
          sticky = 'left'
        ),
        'Variable Name' = colDef(
          minWidth = 150
        ),
        Definition = colDef(
          minWidth = 250
        ),
        'Latest Year' = colDef(minWidth = 75),
        Source = colDef(minWidth = 250),
        Scope = colDef(show = FALSE),
        Resolution = colDef(show = FALSE),
        Url = colDef(
          minWidth = 300,
          show = FALSE
        )
      ),
      defaultColDef = colDef(minWidth = 100),
      elementId = "metadata_table",
      details = function(index) {
        div(
          style = "padding: 15px; border: 1px solid #ddd; margin: 10px 0;
             background-color: #E0EEEE; border-radius: 10px; border-color: black;
             box-shadow: 2px 2px 10px rgba(0, 0, 0, 0.1);",
          
          tags$h4(
            strong("Details"), 
          ),
          tags$p(
            strong('Metric Name: '), 
            as.character(metadata_all[index, 'metric']),
          ),
          tags$p(
            strong('Variable Name: '), 
            as.character(metadata_all[index, 'variable_name']),
          ),
          tags$p(
            strong('Definition: '), 
            as.character(metadata_all[index, 'definition']),
          ),
          tags$p(
            strong('Source: '), 
            as.character(metadata_all[index, 'source'])
          ),
          tags$p(
            strong('Latest Year: '), 
            as.character(metadata_all[index, 'latest_year'])
          ),
          tags$p(
            strong('All Years (cleaned, wrangled, and included): '), 
            as.character(metadata_all[index, 'year'])
          ),
          tags$p(
            strong('Updates: '), 
            str_to_title(as.character(metadata_all[index, 'updates']))
          ),
          tags$p(
            strong('URL: '), 
            tags$a(
              href = as.character(metadata_all[index, 'url']),
              target = '_blank',
              as.character(metadata_all[index, 'url'])
            )
          )
        )
      }
    )
  )
)

3 Data Table

This is another table that includes values. It takes some more legwork to navigate through years and counties, but all the secondary data is included here. Again, use the button to download the filtered view of the table.

Code
pacman::p_load(
  dplyr,
  reactable,
  stringr,
  htmltools
)

# Load metrics and metadata
metadata_all <- readRDS('data/sm_data.rds')[['metadata']]
metrics <- readRDS('data/sm_data.rds')[['metrics']]
fips_key <- readRDS('data/sm_data.rds')[['fips_key']]

# Value formatting function based on units
source('dev/format_values.R')

# Filter to economics metrics, join with metadata and county fips codes
econ_metrics <- metrics %>% 
  left_join(metadata_all, by = join_by('variable_name')) %>% 
  filter(dimension == 'economics') %>% 
  left_join(fips_key, by = join_by('fips')) %>% 
  mutate(county_name = ifelse(is.na(county_name), state_name, county_name)) %>% 
  format_values() %>% 
  select(
    metric,
    'Variable Name' = variable_name,
    definition,
    year = year.x,
    Area = county_name,
    units,
    value
  ) %>% 
  setNames(c(str_to_title(names(.)))) %>% 
  filter(!is.na(Value))


## Reactable table
htmltools::browsable(
  tagList(
    
    tags$div(
      style = "display: flex; gap: 16px; margin-bottom: 20px; justify-content: center;",
      tags$button(
        class = "btn btn-primary",
        style = "display: flex; align-items: center; gap: 8px; padding: 8px 12px;",
        tagList(fontawesome::fa("download"), "Download as CSV"),
        onclick = "Reactable.downloadDataCSV('metrics_table', 'sustainability_metrics.csv')"
      )
    ),
    
    reactable(
      econ_metrics,
      sortable = TRUE,
      resizable = TRUE,
      filterable = TRUE,
      searchable = TRUE,
      pagination = TRUE,
      bordered = TRUE,
      wrap = TRUE,
      rownames = FALSE,
      onClick = 'select',
      striped = TRUE,
      pageSizeOptions = c(5, 10, 25, 50, 100),
      defaultPageSize = 5,
      showPageSizeOptions = TRUE,
      highlight = TRUE,
      style = list(fontSize = "14px"),
      compact = TRUE,
      columns = list(
        Metric = colDef(
          minWidth = 125,
          sticky = 'left'
        ),
        'Variable Name' = colDef(
          minWidth = 125
        ),
        Definition = colDef(
          minWidth = 250
        ),
        Units = colDef(minWidth = 100),
        'Year' = colDef(minWidth = 100)
      ),
      defaultColDef = colDef(minWidth = 100),
      elementId = "metrics_table"
    )
  )
)

4 Distributions

We are taking out the abundant but largely redundant BLS NAICS wage data variables to leave us with a more approachable set of 46 variables to explore here. First just show univariate distributions by county.

Code
pacman::p_load(
  dplyr,
  purrr,
  ggplot2,
  rlang,
  ggpubr,
  tidyr
)

source('dev/data_pipeline_functions.R')
source('dev/filter_fips.R')
metrics <- readRDS('data/sm_data.rds')[['metrics']]
metadata <- readRDS('data/sm_data.rds')[['metadata']]

# Use metadata to get help filter by dimension
econ_meta <- metadata %>% 
  filter(dimension == 'economics')

# Filter to economics dimension
econ_metrics <- metrics %>% 
  filter(variable_name %in% econ_meta$variable_name)

# Filter to latest year and new (post-2024) counties
# Also remove NAICS variables to leave us with an approachable number
# And pivot wider so it is easier to get correlations
econ_metrics_latest <- econ_metrics %>%
  filter_fips(scope = 'new') %>% 
  get_latest_year() %>% 
  filter(
    str_detect(
      variable_name, 
      'Naics|^lq|^avgEmpLvl|expHiredLaborPercOpExp', 
      negate = TRUE
    )
  )

# Pivot wider for easier correlations below
econ_metrics_latest <- econ_metrics_latest %>% 
  select(fips, variable_name, value) %>% 
  unique() %>% 
  mutate(variable_name = str_split_i(variable_name, '_', 1)) %>% 
  pivot_wider(
    names_from = 'variable_name',
    values_from = 'value'
  ) %>% 
  unnest(!fips) %>% 
  mutate(across(c(civLaborForce:last_col()), as.numeric))
Code
pacman::p_load(
  dplyr,
  purrr,
  ggplot2,
  rlang,
  ggpubr,
  tidyr
)

plots <- map(names(econ_metrics_latest)[-1], \(var){
  if (is.character(econ_metrics_latest[[var]])) {
    econ_metrics_latest %>% 
      ggplot(aes(x = !!sym(var))) + 
      geom_bar(
        fill = 'lightblue',
        color = 'royalblue',
        alpha = 0.5
      ) +
      theme_classic() +
      theme(plot.margin = unit(c(rep(0.5, 4)), 'cm'))
  } else if (is.numeric(econ_metrics_latest[[var]])) {
    econ_metrics_latest %>% 
      ggplot(aes(x = !!sym(var))) + 
      geom_density(
        fill = 'lightblue',
        color = 'royalblue',
        alpha = 0.5
      ) +
      theme_classic() +
      theme(plot.margin = unit(c(rep(0.5, 4)), 'cm'))
  } else {
    return(NULL)
  }
}) 


# Arrange them in 4 columns
ggarrange(
  plotlist = plots,
  ncol = 4,
  nrow = 12
)

Distributions of economic metrics at the county level.
Code
# ggsave(
#   'images/distribution_plots.png',
#   plot = out,
#   dpi = 150,
#   width = 1500,
#   height = 3000,
#   units = 'px'
# )

5 Correlation Heatmap

Throwing those same variables into a correlation matrix. Hover to see variable names, Pearson correlation, and p-values.

Code
pacman::p_load(
  dplyr,
  tidyr,
  tibble,
  stringr,
  purrr,
  tidyr,
  ggplot2,
  plotly,
  reshape,
  Hmisc,
  viridisLite
)

# Arrange variables in some halfway reasonable order
cor_dat <- econ_metrics_latest %>% 
  select(
    matches('Code_|metro'),
    matches('employ|abor|Worker'),
    matches('Sales'),
    matches('Earn|Income'),
    everything(),
    -fips,
    -matches('expHiredLaborPercOpExp') # This one didn't come through
  )

# Make a correlation matrix using all the selected variables
cor <- cor_dat %>% 
  as.matrix() %>% 
  rcorr()

# Melt correlation values and rename columns
cor_r <- melt(cor$r) %>% 
  setNames(c('var_1', 'var_2', 'value'))

# Save p values
cor_p <- melt(cor$P)
p.value <- cor_p$value

# Make heatmap with custom text aesthetic for tooltip
plot <- cor_r %>% 
  ggplot(aes(var_1, var_2, fill = value, text = paste0(
    'Var 1: ', var_1, '\n',
    'Var 2: ', var_2, '\n',
    'Correlation: ', format(round(value, 3), nsmall = 3), '\n',
    'P-Value: ', format(round(p.value, 3), nsmall = 3)
  ))) + 
  geom_tile() + 
  scale_fill_viridis_c() + 
  theme(axis.text.x = element_text(hjust = 1, angle = 45)) +
  labs(
    x = NULL,
    y = NULL,
    fill = 'Correlation'
  )

# Convert to interactive plotly figure with text tooltip
ggplotly(
  plot, 
  tooltip = 'text',
  width = 1000,
  height = 800
)

Interactive Correlation Plot

6 PCA

PCA is a popular tool in this area for exploring unique variation with many collinear variables. It is a way to reduce the dimensionality of the data into fewer, more interpretable principal components.

It also requires complete data, which we do not have. So I’m using a random forest algorithm to impute data here (Stekhoven and Bühlmann 2012). This really warrants a deeper dive into the type and severity of missingness, but I’m just going to run with it for now.

Code
pacman::p_load(
  missForest
)

# Wrangle dataset. Need all numeric vars or factor vars. And can't be tibble
# Also removing character vars - can't use these in PCA
dat <- econ_metrics_latest %>%
  select(where(is.numeric)) %>%
  as.data.frame()
# get_str(dat)

# Check missing variables
# skimr::skim(dat)

# Impute missing variables
set.seed(42)
mf_out <- dat %>%
  missForest(
    ntree = 200,
    mtry = 10,
    verbose = FALSE,
    variablewise = FALSE
  )

# Save imputed dataset
imp <- mf_out$ximp

# Print OOB
mf_out$OOBerror
    NRMSE 
0.1839393 

Out of bag error is shown as normalized root mean square error. Now we can explore how many composite factors is appropriate for the data.

Code
pacman::p_load(
  psych
)
VSS(imp)


Very Simple Structure
Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm, 
    n.obs = n.obs, plot = plot, title = title, use = use, cor = cor)
VSS complexity 1 achieves a maximimum of 0.68  with  2  factors
VSS complexity 2 achieves a maximimum of 0.87  with  2  factors

The Velicer MAP achieves a minimum of 0.04  with  7  factors 
BIC achieves a minimum of  -351.11  with  5  factors
Sample Size adjusted BIC achieves a minimum of  1623.83  with  8  factors

Statistics by number of factors 
  vss1 vss2   map dof chisq
1 0.58 0.00 0.099 860  4507
2 0.68 0.87 0.059 818  3743
3 0.64 0.84 0.058 777  3434
4 0.65 0.84 0.056 737  3025
5 0.59 0.83 0.044 698  2690
6 0.58 0.81 0.043 660  2582
7 0.57 0.79 0.042 623  2431
8 0.57 0.81 0.045 587  2331
                                                                                                                                                                                                                                                                                  prob
1 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
2 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
3 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
4 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011
5 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000065155928144872149385125048581812734482809901
6 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000122685234038161408720319506260310049583495128899813
7 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002610076918036057804471672394441839060164056718349456787109375000
8 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001091194080860970601187104334073296740825753659009933471679687500000000
  sqresid  fit RMSEA  BIC SABIC complex eChisq  SRMR eCRMS  eBIC
1   109.7 0.58  0.23  760  3472     1.0   6460 0.214 0.219  2713
2    35.1 0.87  0.21  180  2759     1.3   1483 0.103 0.108 -2081
3    24.6 0.91  0.21   49  2499     1.6    920 0.081 0.087 -2465
4    18.5 0.93  0.20 -186  2137     1.7    610 0.066 0.073 -2601
5    14.1 0.95  0.19 -351  1850     1.7    397 0.053 0.060 -2644
6    11.7 0.96  0.19 -294  1787     1.9    308 0.047 0.055 -2568
7     9.4 0.96  0.19 -283  1681     2.0    214 0.039 0.047 -2500
8     8.1 0.97  0.19 -227  1624     2.0    174 0.035 0.044 -2383
Code
fa.parallel(imp)

Parallel analysis suggests that the number of factors =  5  and the number of components =  4 

VSS gives a wide range from 2 to 8, MAP shows 7, parallel analysis shows 4. I tend to trust PA the most, so let’s go with 4.

Code
(pca_out <- pca(imp, nfactors = 4))
Principal Components Analysis
Call: principal(r = r, nfactors = nfactors, residuals = residuals, 
    rotate = rotate, n.obs = n.obs, covar = covar, scores = scores, 
    missing = missing, impute = impute, oblique.scores = oblique.scores, 
    method = method, use = use, cor = cor, correct = 0.5, weight = NULL)
Standardized loadings (pattern matrix) based upon correlation matrix
                        RC1   RC2   RC3   RC4    h2    u2 com
civLaborForce          0.22  0.90  0.19  0.02 0.901 0.099 1.2
employed               0.22  0.90  0.19  0.02 0.900 0.100 1.2
unemployed             0.22  0.91  0.16  0.10 0.921 0.079 1.2
unemploymentRate       0.12  0.13 -0.05  0.76 0.617 0.383 1.1
medHhIncome            0.07  0.34  0.83 -0.03 0.802 0.198 1.3
medHhIncomePercState   0.06  0.07  0.82 -0.16 0.702 0.298 1.1
gini                  -0.19  0.39  0.00  0.46 0.403 0.597 2.3
nCSA                   0.16  0.37 -0.10 -0.24 0.235 0.765 2.3
nFarmersMarket         0.33  0.84  0.13 -0.05 0.828 0.172 1.4
nOnFarmMarket          0.03  0.07  0.12 -0.09 0.029 0.971 2.6
agTourSalesPerc       -0.15  0.26  0.32  0.63 0.581 0.419 2.0
d2cSalesPerc          -0.24  0.41  0.52  0.10 0.501 0.499 2.4
localSalesPerc        -0.18  0.39  0.53  0.04 0.468 0.532 2.1
nAnaerDigestion        0.54 -0.37 -0.12  0.10 0.451 0.549 2.0
nCompost               0.36  0.78  0.16  0.07 0.761 0.239 1.5
nFoodHubs              0.17 -0.02  0.02 -0.12 0.045 0.955 1.8
nMeatProcess           0.16  0.80  0.07  0.05 0.669 0.331 1.1
medianEarnMaleFood    -0.01  0.23  0.09 -0.35 0.186 0.814 1.9
medianEarnFemaleFood  -0.19  0.12  0.17  0.63 0.478 0.522 1.4
womenEarnPercMaleFood -0.15 -0.15 -0.04  0.67 0.501 0.499 1.2
medianEarnMaleFarm    -0.17  0.04  0.33  0.30 0.228 0.772 2.5
medianEarnFemaleFarm  -0.11 -0.06  0.71  0.20 0.565 0.435 1.2
womenEarnPercMaleFarm -0.08  0.01  0.63  0.08 0.409 0.591 1.1
nHiredWorkers          0.92  0.29  0.01 -0.11 0.944 0.056 1.2
nOpsMigrantWorkers     0.79  0.09  0.04 -0.10 0.647 0.353 1.1
nOpsHiredLabor         0.88  0.26 -0.07 -0.25 0.902 0.098 1.4
nOpsHiredLaborExp      0.88  0.26 -0.06 -0.25 0.902 0.098 1.4
nWorkersLE150          0.88  0.23 -0.04 -0.06 0.839 0.161 1.2
nMigrantWorkers        0.60 -0.14 -0.19  0.13 0.427 0.573 1.4
nOpsWorkersLE150       0.88  0.24 -0.09 -0.24 0.900 0.100 1.3
nWorkersGE150          0.85  0.37  0.09 -0.11 0.885 0.115 1.4
nOpsWorkersGE150       0.89  0.28  0.02 -0.20 0.906 0.094 1.3
nOpsUnpaidWorkers      0.69  0.25 -0.19 -0.38 0.717 0.283 2.1
nUnpaidWorkers         0.64  0.34 -0.19 -0.35 0.697 0.303 2.4
expHiredLabor          0.95  0.14  0.05 -0.07 0.935 0.065 1.1
expHiredLaborPF        0.62 -0.08  0.25  0.15 0.472 0.528 1.5
expPF                  0.75 -0.33  0.09  0.30 0.774 0.226 1.8
farmIncomePF           0.31  0.26  0.64  0.09 0.585 0.415 1.9
acresOperated          0.70 -0.41 -0.36 -0.05 0.790 0.210 2.2
acresPF                0.32 -0.58 -0.54  0.11 0.743 0.257 2.6
medianAcresPF          0.22 -0.52 -0.62 -0.01 0.708 0.292 2.2
landValPF              0.21 -0.19  0.55  0.01 0.380 0.620 1.6
landValPerAcre        -0.14  0.47  0.47  0.21 0.513 0.487 2.6

                        RC1  RC2  RC3  RC4
SS loadings           11.00 7.46 5.19 3.20
Proportion Var         0.26 0.17 0.12 0.07
Cumulative Var         0.26 0.43 0.55 0.62
Proportion Explained   0.41 0.28 0.19 0.12
Cumulative Proportion  0.41 0.69 0.88 1.00

Mean item complexity =  1.7
Test of the hypothesis that 4 components are sufficient.

The root mean square of the residuals (RMSR) is  0.07 
 with the empirical chi square  680.81  with prob <  0.93 

Fit based upon off diagonal values = 0.96
Code
plot(pca_out$values)
abline(h = 1)

From the scree plot and eigenvalues it looks like the first three components bear lots of unique variance, but after that there is no clear elbow where a qualitative decision can be made to choose a certain number of components. The Kaiser-Guttman rule suggests keeping any compents with an eigenvalue > 1 (at the horizontal line), but we can see here that this is a rather dubious distinction.

If we look at the output from the PCA call, we can see how closely each variable (row) correlates with each component (columns 1-4). The variables most associated with Component #1 are the farm labor variables - numbers of workers, labor expenses, etc. They also tend to be raw figures, and probably have more to do with population than anything else. Component #2 is made up mostly of generic employment figures - total civilian labor force, total employed, total unemployed. These are not specific to food systems. Component #3 has a curious collection of median earnings variables and ‘per farm’ variables like acres per farm, income per farm, and local and direct-to-consumer sales. Component #4 does not represent much unique variance, and loooks like a grab bag of variables.

A couple of early takeaways here are that the raw figures that are tied to population probably shouldn’t be mixed with other variables like proportions. We could try normalizing all the variables so that raw variables are not disproportionately weighted. But it might make more sense to avoid raw counts and dollar amounts entirely.

Back to top

7 References

Stekhoven, Daniel J., and Peter Bühlmann. 2012. MissForest—Non-Parametric Missing Value Imputation for Mixed-Type Data.” Bioinformatics 28 (1): 112–18. https://doi.org/10.1093/bioinformatics/btr597.